GH-48868: [Doc] Document security model for the Arrow formats by pitrou · Pull Request #48870 · apache/arrow

pitrou · 2026-01-15T15:43:43Z

Rationale for this change

Accessing Arrow data or any of the formats can have non-trivial security implications, this is an attempt at documenting those.

What changes are included in this PR?

Add a Security Considerations page in the Format section.

Doc preview: https://s3.amazonaws.com/arrow-data/pr_docs/48870/format/Security.html

Are these changes tested?

N/A

Are there any user-facing changes?

No.

GitHub Issue: [Doc] Document security model for the Arrow formats #48868

pitrou · 2026-01-15T15:45:55Z

@github-actions crossbow submit preview-docs

github-actions · 2026-01-15T15:48:40Z

Revision: 593babb

Submitted crossbow builds: ursacomputing/crossbow @ actions-4f7018459b

Task	Status
preview-docs

raboof

Looks reasonable (without any particular Arrow expertise)

(noticed two typo's)

docs/source/format/Security.rst

felipecrv · 2026-01-15T19:43:50Z

docs/source/format/Security.rst

+uninitialized in a buffer if the array might be sent to, or read by, a untrusted
+third-party, even when the uninitialized data is logically irrelevant. The
+easiest way to do this, though perhaps not the most efficient, is to zero-initialize
+any buffer that will not be populated in full.


Worth pointing out something about query engines and dataframe libraries deciding to not do so for internal/intermediate values in computations but applying a canonicalization pass when data leaves the system.

Perhaps we can emphasize that all bytes in an Arrow array, regardless if they are "reachable", are readable by other libraries and users. Thus they should contain no potentially sensitive data (like uninitialized values).

And therefore, if query engines choose to use uninitialized memory internally as an optimization, they should ensure all such uninitialized values are cleared before passing the Arrays to another system

felipecrv · 2026-01-15T19:50:20Z

docs/source/format/Security.rst

+from an untrusted source (for example because you are writing a proxy to
+an arbitrary third-party service), it is **recommended** that you validate
+the data first, as the consumer may assume that the data is valid already.
+


Suggested change

In addition to invalid pointers, some array types have offsets, sizes, and buffer indices that might be out-of-bounds. The library producing arrays through the

C data interface might be performing only very light validation of these values.

alamb

Thank you @pitrou -- this is much needed and very helpful

I had some suggestions on structure. Hopefully they are helpful

docs/source/format/Security.rst

alamb · 2026-01-16T11:34:01Z

docs/source/format/Security.rst

+Advice for users
+''''''''''''''''
+
+If you receive Arrow in-memory data from an untrusted source, it is


I suggest we also make the point about performance here to give context about why
validation is not always performed

Perhaps something like this:

"Arrow implementations often assume Arrays follow the specification
to provide high speed processing. It is extremely important that
your application either trusts or validates the Arrays it receives from
other sources.

Many Arrow implementations provide APIs to do such validation.

In terms of APIs, the Rust implementation always validates data from external sources, unless the validation is explicitly turned off with APIs marked as unsafe (a special Rust keyword).

alamb · 2026-01-16T11:40:56Z

docs/source/format/Security.rst

+uninitialized in a buffer if the array might be sent to, or read by, a untrusted
+third-party, even when the uninitialized data is logically irrelevant. The
+easiest way to do this, though perhaps not the most efficient, is to zero-initialize
+any buffer that will not be populated in full.


Perhaps we can emphasize that all bytes in an Arrow array, regardless if they are "reachable", are readable by other libraries and users. Thus they should contain no potentially sensitive data (like uninitialized values).

And therefore, if query engines choose to use uninitialized memory internally as an optimization, they should ensure all such uninitialized values are cleared before passing the Arrays to another system

alamb · 2026-01-16T11:42:10Z

docs/source/format/Security.rst

+''''''''''''''''
+
+If you produce a C Data Interface structure for data that nevertheless comes
+from an untrusted source (for example because you are writing a proxy to


I don't think this is any different than the other APIs -- basically "if you don't trust the producer source, you should always explicitly validate the arrays before processing them"

This doesn't seem any different for the C Data Interface than for the other APIs (like IPC files. etc)

Hmm, that's true. I might just remove if it muddies the message.

alamb · 2026-01-16T11:43:07Z

docs/source/format/Security.rst

+a trusted producer, for the reason explained above. However, it is still **recommended**
+that you validate it for soundness, as a trusted producer can have bugs anyway.
+
+IPC Format


As above, I think we could combine this into the section about validating data from untrusted sources, and give C Data Interface and IPC Format as examples of potentially untrusted sources.

"This" means the IPC format section?

Yes -- I was thinking that if the guidance is the same for IPC and C Data Interface, calling them out separately makes this documentation more verbose than it could be

The high level overview in my mind is:

If you read Arrow data (in any format) from an untrusted source, it can be a memory and security concern. Thus you should always verify such data.

Implementers should provide a way to let users opt in / out of the verification depending on the source and their security threat model.

If there are specific things to suggest remembering to check that is specific for specific formats such as the FlagBuffers offsets for IPC based formats, then we could call them out in those sections.

I am happy to propose some new wording if you like

Well, the guidance is not the same. For the C Data Interface, it is simply impossible (to my knowledge) to protect against rogue raw pointers. For IPC, validation is possible to ensure safe operation.

Yes, there's no way to meaningfully validate an ArrowArray other than perhaps to check pointers for unexpected NULL. This is because the ArrowArray does not provide buffer lengths (the consumer has to infer those based on the ArrowSchema and/or the buffer data of previous buffers). This is roughly the same as receiving an Arrow C++ or arrow-rs array from another library: the consumer has to assume it was correctly produced to avoid a crash and a malicious consumer can always attempt to read byte ranges it isn't supposed to based on the buffer pointers).

docs/source/format/Security.rst

pitrou · 2026-01-26T17:12:24Z

I've addressed most review comments and expanded the document quite a bit:

did a bit of rewording, for hopefully better clarity
added a paragraph about invalid values (such as utf8 in a String array)
added a section about deserialization of registered extension types
added a section about robustness testing for implementations
added a stub "non-Arrow formats" section

Another round of reviewing is welcome!

pitrou · 2026-01-26T17:12:53Z

@github-actions crossbow submit preview-docs

github-actions · 2026-01-26T17:15:02Z

Revision: 40c916b

Submitted crossbow builds: ursacomputing/crossbow @ actions-c3adcc06fb

Task	Status
preview-docs

pitrou · 2026-01-26T18:09:13Z

Rendered document preview at https://s3.amazonaws.com/arrow-data/pr_docs/48870/format/Security.html

pitrou · 2026-01-27T09:39:27Z

cc @paleolimbot @Alex-PLACET @wgtmac @lidavidm

pitrou · 2026-01-27T11:08:00Z

@github-actions crossbow submit preview-docs

pitrou · 2026-02-04T10:22:13Z

@github-actions crossbow submit preview-docs

github-actions · 2026-02-04T10:24:24Z

Revision: 131d70f

Submitted crossbow builds: ursacomputing/crossbow @ actions-4dc8dcf3d1

Task	Status
preview-docs

pitrou · 2026-02-04T11:51:18Z

It looks the preview-docs build now fails for an unrelated reason.

In any case, do reviewers agree that this is good enough to go, or do you think there should be further changes?

alamb

Thank you @pitrou and other reviewers

I think this document is a great addition to the Arrow documentation and will be a great reference to refer people to

I left some small wording suggestions, but I don't think any of them are required to merge this PR.

docs/source/format/Security.rst

alamb · 2026-02-05T10:49:21Z

docs/source/format/Security.rst

+1. *users* of Arrow: that is, developers of third-party libraries or applications
+   that use some of the Arrow formats or protocols by calling into Arrow libraries
+   as defined below;
+
+2. *implementors* of Arrow libraries: that is, libraries that provide APIs
+   abstraction away from the details of the Arrow formats and protocols; such
+   libraries include the official Arrow implementations documented on
+   https://arrow.apache.org, but not only.


Suggested change

1. *users* of Arrow: that is, developers of third-party libraries or applications

that use some of the Arrow formats or protocols by calling into Arrow libraries

as defined below;

2. *implementors* of Arrow libraries: that is, libraries that provide APIs

abstraction away from the details of the Arrow formats and protocols; such

libraries include the official Arrow implementations documented on

https://arrow.apache.org, but not only.

1. *users* of Arrow: that is, developers of third-party libraries or applications

that consume data using Arrow formats or protocols created by another application.

2. *implementors* of Arrow libraries: that is, libraries that provide APIs

abstraction away from the details of the Arrow formats and protocols; such

libraries include, but are not limited to, the official Arrow implementations documented on

https://arrow.apache.org.

Hmm, that changes the meaning quite a bit. I wanted to stress that "users" don't directly implement the Arrow specs, they use language-specific abstraction layers provided by an implementation. Perhaps I should just use these words :)

I've tried to improve the wording a bit.

The new wording looks good to me 👍

users of Arrow: that is, developers of third-party libraries or applications
that don't implement directly implement the Arrow formats or protocols, but
instead call language-specific APIs provided by an Arrow library
(as defined below);

alamb · 2026-02-05T10:57:33Z

docs/source/format/Security.rst

+.. TODO:
+   For each layout, we should list the associated security risks and the recommended
+   steps to validate (perhaps in Columnar.rst)


It seems like links to the invalid fuzz data has been added above

I personally think this PR is already hugely valuable, even without such a list.

Thus I suggest we merge this PR and get it published. We can file an issue to track adding type specific security risks as a follow on

docs/source/format/Security.rst

alamb · 2026-02-05T11:03:36Z

docs/source/format/Security.rst

+A typical validation API must return a well-defined error, not crash, if the
+given Arrow data is invalid; it must always be safe to execute regardless of
+whether the data is valid or not.
+


BTW the way the Rust API works is that by default, data read from IPC is explicitly validated (which is indeed quite expensive)

It is possible to turn this validation off via an unsafe API (the unsafe bit means the calling application has to explicitly disable the validation to acknowledge they are trusting the source):

https://docs.rs/arrow-ipc/57.2.0/arrow_ipc/reader/struct.FileDecoder.html#method.with_skip_validation

alamb · 2026-02-05T11:31:41Z

docs/source/format/Security.rst

+Advice for users
+----------------
+
+You should **never** consume a C Data Interface structure from an untrusted


This is good and clear 👍

docs/source/format/Security.rst

pitrou · 2026-02-05T13:38:12Z

@github-actions crossbow submit preview-docs

github-actions · 2026-02-05T13:40:26Z

Revision: 61a3e81

Submitted crossbow builds: ursacomputing/crossbow @ actions-19f42b185d

Task	Status
preview-docs

docs/source/format/Security.rst

paleolimbot

Thank you for these updates and putting this document together!

pitrou · 2026-02-05T15:43:42Z

Thanks a lot for the reviews!

github-actions bot added Component: Documentation awaiting review Awaiting review labels Jan 15, 2026

raboof reviewed Jan 15, 2026

View reviewed changes

docs/source/format/Security.rst Outdated Show resolved Hide resolved

docs/source/format/Security.rst Outdated Show resolved Hide resolved

github-actions bot added awaiting committer review Awaiting committer review and removed awaiting review Awaiting review labels Jan 15, 2026

felipecrv reviewed Jan 15, 2026

View reviewed changes

github-actions bot added awaiting changes Awaiting changes and removed awaiting committer review Awaiting committer review labels Jan 15, 2026

alamb reviewed Jan 16, 2026

View reviewed changes

raulcd reviewed Jan 16, 2026

View reviewed changes

docs/source/format/Security.rst Outdated Show resolved Hide resolved

github-actions bot added awaiting change review Awaiting change review and removed awaiting changes Awaiting changes labels Jan 26, 2026

pitrou force-pushed the gh48868-format-security-model branch 2 times, most recently from 59c98d4 to 40c916b Compare January 26, 2026 17:12

github-actions bot added awaiting changes Awaiting changes and removed awaiting change review Awaiting change review labels Jan 26, 2026

pitrou force-pushed the gh48868-format-security-model branch from 40c916b to 068cc96 Compare January 26, 2026 17:44

github-actions bot added awaiting change review Awaiting change review and removed awaiting changes Awaiting changes labels Jan 26, 2026

pitrou force-pushed the gh48868-format-security-model branch from 068cc96 to 29ccd3b Compare January 27, 2026 11:07

pitrou marked this pull request as ready for review January 27, 2026 11:08

lidavidm approved these changes Jan 27, 2026

View reviewed changes

github-actions bot added the awaiting change review Awaiting change review label Feb 4, 2026

pitrou force-pushed the gh48868-format-security-model branch from 131d70f to f269af7 Compare February 4, 2026 10:38

alamb mentioned this pull request Feb 5, 2026

Add fuzz regression testing to parquet/arrow/csv readers apache/arrow-rs#9358

Open

alamb approved these changes Feb 5, 2026

View reviewed changes

alamb requested a review from Alex-PLACET February 5, 2026 11:42

github-actions bot added awaiting merge Awaiting merge and removed awaiting change review Awaiting change review labels Feb 5, 2026

Alex-PLACET approved these changes Feb 5, 2026

View reviewed changes

apacheGH-48868: [Doc] Document security model for the Arrow formats

61a3e81

pitrou force-pushed the gh48868-format-security-model branch from ebbe049 to 61a3e81 Compare February 5, 2026 13:30

github-actions bot added awaiting changes Awaiting changes and removed awaiting merge Awaiting merge labels Feb 5, 2026

alamb reviewed Feb 5, 2026

View reviewed changes

docs/source/format/Security.rst Outdated Show resolved Hide resolved

Remove duplicate word

0546f05

github-actions bot added awaiting change review Awaiting change review and removed awaiting changes Awaiting changes labels Feb 5, 2026

paleolimbot approved these changes Feb 5, 2026

View reviewed changes

github-actions bot added awaiting merge Awaiting merge and removed awaiting change review Awaiting change review labels Feb 5, 2026

pitrou merged commit f39f275 into apache:main Feb 5, 2026
11 checks passed

pitrou removed the awaiting merge Awaiting merge label Feb 5, 2026

pitrou mentioned this pull request Feb 5, 2026

[Doc] Document security model for the Arrow formats #48868

Closed

pitrou deleted the gh48868-format-security-model branch February 5, 2026 15:43

pitrou mentioned this pull request Feb 5, 2026

Announce Arrow security model apache/arrow-site#753

Open

2 tasks



	In addition to invalid pointers, some array types have offsets, sizes, and buffer indices that might be out-of-bounds. The library producing arrays through the
	C data interface might be performing only very light validation of these values.

Conversation

pitrou commented Jan 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Uh oh!

pitrou commented Jan 15, 2026

Uh oh!

github-actions bot commented Jan 15, 2026

Uh oh!

raboof left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

alamb left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

pitrou commented Jan 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pitrou commented Jan 26, 2026

Uh oh!

github-actions bot commented Jan 26, 2026

Uh oh!

pitrou commented Jan 26, 2026

Uh oh!

pitrou commented Jan 27, 2026

Uh oh!

pitrou commented Jan 27, 2026

Uh oh!

pitrou commented Feb 4, 2026

Uh oh!

github-actions bot commented Feb 4, 2026

Uh oh!

pitrou commented Feb 4, 2026

Uh oh!

alamb left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

pitrou commented Jan 15, 2026 •

edited

Loading

pitrou commented Jan 26, 2026 •

edited

Loading